Predictive Modeling with Heterogeneous Sources
نویسندگان
چکیده
Lack of labeled training examples is a common problem for many applications. At the same time, there is often an abundance of labeled data from related tasks, although they have different distributions and outputs (e.g., different class labels, and different scales of regression values). In the medical domain, for example, we may have a limited number of vaccine efficacy examples against a new swine flu H1N1 epidemic, whereas there exists a large amount of labeled vaccine data from previous years’ flu. However, it is difficult to directly apply the older flu vaccine data as training examples because of the difference in data distribution and efficacy output criteria between different viruses. To increase the sources of labeled data, we propose a method to utilize these examples whose marginal distribution and output criteria can be different. The idea is to first select a subset of source examples similar in distribution to the target data; all the selected instances are then “re-scaled” and assigned new output values from the labeled space of the target task. A new predictive model is built on the enlarged training set. We derive a generalization bound that specifically considers distribution difference and further evaluate the model on a number of applications. For an siRNA efficacy prediction problem, we extract examples from 4 heterogeneous regression tasks and 2 classification tasks to learn the target model, and achieve an average improvement of 30% in accuracy.
منابع مشابه
A Practical Activity Capture Framework for Personal, Lifetime User Modeling
This paper addresses the problem of capturing rich, longterm personal activity logs of users’ interactions with their workstations, for the purpose of deriving predictive, personal user models. Our architecture addresses a number of practical problems with activity capture, including incorporating heterogeneous information from different applications, measuring phenomena with different rates of...
متن کاملScaling Access to Heterogeneous Data Sources withDiscoDRAFT { NOT FOR DISTRIBUTION { SEE TKDE 1998 FOR FINAL
1 Scaling Access to Heterogeneous Data Sources with Disco DRAFT { NOT FOR DISTRIBUTION { SEE TKDE 1998 FOR FINAL VERSION Anthony Tomasic, Louiqa Raschid and Patrick Valduriez Abstract|Accessing many data sources aggravates problems for users of heterogeneous distributed databases. Database administrators must deal with fragile mediators, that is, mediators with schemas and views that must be si...
متن کاملBi-level multi-source learning for heterogeneous block-wise missing data
Bio-imaging technologies allow scientists to collect large amounts of high-dimensional data from multiple heterogeneous sources for many biomedical applications. In the study of Alzheimer's Disease (AD), neuroimaging data, gene/protein expression data, etc., are often analyzed together to improve predictive power. Joint learning from multiple complementary data sources is advantageous, but feat...
متن کاملFacies Modeling of Heterogeneous Carbonates Reservoirs by Multiple Point Geostatistics
Facies modeling is an essential part of reservoir characterization. The connectivity of facies model is very critical for the dynamic modeling of reservoirs. Carbonate reservoirs are so heterogeneous that variogram-based methods like sequential indicator simulation are not very useful for facies modeling. In this paper, multiple point geostatistics (MPS) is used for facies modeling in one of th...
متن کاملLearning Relational Bayesian Classifiers on the Semantic Web
With the advent of the Semantic Web, there is an increased availability of meta data (ontologies) that make explicit the semantic commitments associated with data and an urgent need for machine learning algorithms for building predictive models from such data. Usually, there is no unique global interpretation of data from semantically disparate, autonomous sources. Furthermore, it is neither fe...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010